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FOREWORD 


A  simple  selective  learning  algorithm  for  use  with  Multilayer  Per- 
ceptrons  (MLPs)  is  presented.  This  algorithm  has  proved  useful  in  certain 
types  of  problems  where  learning  failure  occurs  using  standard  back  propaga¬ 
tion.  Examples  of  these  problems  are  included.  The  algorithm  is  based  on 
the  rms  output  error,  computed  across  all  output  nodes  and  all  training 
patterns.  The  learning  rate  is  decreased  for  all  individual  output  nodes 
each  time  the  error  is  less  than  a  user  chosen  multiple  of  the  rms  error 
corresponding  to  the  previous  pass.  This  algorithm  has  produced  convergence 
where  the  standard  fixed  gain  back  propagation  failed.  /  , 
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INTRODUCTION 

The  discovery  of  a  back  propagation  rule  for  hidden  layers  in  multilayer 

1  ? 
perceptrons  by  Werbos,  and  the  subsequent  rediscovery  by  Rumelhart  et  al. 

has  been  fundamental  in  the  resurgence  of  interest  in  feed  forward  networks 

that  has  occurred  in  recent  years.  Several  notable  successes  by  Sejnowski  and 

Rosenberg,  Gorman  and  Sejnowski,  and  Tesauro  and  Sejnowski5  have  contributed 

to  the  interest  and  spurred  a  great  deal  of  optimism  in  the  field.  As  Minsky 

and  Papert^  have  pointed  out  however,  error  back  propagation  based  on  gradient 

descent  is  not  without  shortcomings  and  limitations. 

The  primary  problem  we  have  encountered  is  that  occurring  when  one  of 
the  classes  of  training  patterns  is  under-represented  in  the  training  set. 

The  symptoms  of  this  problem  are  manifested  as  one  or  more  patterns  for  which 
the  output  error  converges  to  1.0  for  one  or  more  of  the  output  units. 
Sometimes  this  can  be  circumvented  by  adjusting  the  gain  and/or  momentum  term 
or  by  varying  the  network  size.  In  our  experience,  there  are  some  pathologi¬ 
cal  cases  where  these  measures  just  do  not  solve  the  convergence  problem. 

A  related  problem  that  is  perhaps  easier  to  visualize  is  one  where  there 
are  a  large  number  (we  assume  20  for  this  example)  of  output  nodes,  with  each 
unit  corresponding  to  a  distinct  class.  For  each  input  pattern,  the  correct 
output  will  consist  of  a  single  output  unit  "on"  with  a  value  near  1.0,  and 
with  the  rest  "off"  with  output  values  near  0.  It  may  happen  that  there  is  a 
local  minimum  for  one  set  of  inputs  where  all  output  units  are  "off."  As  an 
example,  with  20  output  units  all  "off"  with  values  of  zero,  the  average  error 
is  0.05  which  is  close  to  the  true  minimum  error  of  zero.  As  the  number  of 
output  units  is  increased,  the  average  or  rms  error  decreases  so  that  a  local 
minimum  close  to  the  ttue  minimum  becomes  more  likely. 

We  have  developed  and  used  successfully  a  simple  selective  learning 
algorithm  which  functions  to  steer  the  network  away  from  these  local  minima  by 
distorting  the  error  surface.  It  is  presented  in  the  following  section.  Some 
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discussion  on  its  use  with  two  examples  are  included  in  the  last  section  along 
with  some  closing  comments. 


SELECTIVE  LEARNING  ALGORITHM 


HACK  PROPAGATION  RULE 

Training  the  multilayer  perceptron  is  divided  into  a  feed  forward  step 

which  propagates  the  input  pattern  up  through  the  net  structure  until  the 

output  nodes  are  reached  and  an  error  back  propagation  step  which  modifies  the 

edge  weights  of  the  network  to  insure  improved  performance  of  the  network  on 

2 

the  next  presentation  of  the  pattern.  These  steps  are  executed  repeatedly 
for  the  set  of  input  patterns  until  the  desired  answers  are  obtained  at  the 
output  nodes.  The  steps  in  the  feed  forward  process  are  as  follows.  For  the 
j-th  node  a  weighted  sum  of  its  inputs  is  computed 


2> 


ji  pi 


+  9  . 
3 


In  the  next  step  the  sigmoid  transfer  function  is  applied  to  this  weighted  sum 


°Pj  S(Tpj)  1  +  exp(-oT^) 

In  the  back  propagation  phase,  edge  weight  corrections  are  computed. 
First  the  error  for  each  output  node  j  is  computed 


6  .  =  (t  .  -  o  ,)o  .(1  -  o  .) 
Pj  PJ  Pj  Pj  P3 


Next  the  error  term  for  each  hidden  unit  j  is  computed. 


5  .  ■»  o  .(1  -  o  .)T]{  ,w,  . 

Pj  Pj  Pj  ^  pk  k j 

The  6  .s  are  the  error  contributions  from  the  nodes  above  node  j  in  the 
px 

network.  Once  all  the  error  terms  are  computed  the  edge  weights  are  adjusted 
according  to 


Wj±(t  +  1)  *  w^Ct)  +  e<5pj°pi  +  ~  ~  !)) 
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Weights  are  adjusted  after  each  presentation  of  each  pattern.  The  offsets  for 
the  hidden  and  output  units  are  treated  as  edge  weights  from  units  which  have 
a  constant  value  of  1. 


SELECTIVE  LEARNING  FOR  BACK  PROPAGATION 

The  fundamental  idea  of  our  algorithm  is  to  selectively  deweight  the 
output  errors  for  certain  output  units  in  a  dynamic  manner.  This  is  ac¬ 
complished  by  multiplying  the  output  error  by  a  constant  (<1)  for  those  output 
units  for  which  the  output  error  is  less  than  a  user  input  multiple  of  the  rms 
error  a  as  computed  based  on  all  of  the  output  units  for  all  training  pat¬ 
terns. 

The  two  user  inputs  are: 

5  =  scaling  factor  for  the  output  error  term, 
p  =  multiple  of  the  rms  error. 

The  steps  in  the  algorithm  are: 

1.  For  the  first  pass  through  all  of  the  training  patterns, 
the  output  error  for  all  output  units  is  decreased  by 


(tPd  "  °pj>  ^  c(tPj  -  °pj)’  Vpo. 

This  effectively  decreases  the  gain  for  all  corrections  by  a 
factor  of  5. 

2.  All  weight  and  offset  adjustments  are  carried  out  as  ususal 
(see  The  Back  Propagation  Rule). 

3.  The  rms  error  o  is  computed  for  the  current  pass. 

4.  The  next  pass  is  carried  out  with  selective  learning.  For 
each  output  unit  for  each  pattern; 


Vp,  j: 


Ifcpj  “  °pjl  <  then 


(tpj  -  V  *  5(tpj  -  V- 


5.  All  weight  and  offset  adjustments  are  carried  out  as  usual 
based  on  the  reduced  output  errors. 

6.  Repeat  starting  at  step  3. 
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APPLICATION  STRATEGIES 

There  is  generally  no  point  in  employing  selective  learning  until  a 
training  problem  has  been  encountered.  Once  it  is  clear  that  the  problem  is 
not  simply  too  small  a  network,  then  selective  learning  should  be  considered 
as  a  means  of  solution.  An  approach  that  has  been  found  to  work  very  well  is 
to  start  with  a  "brain"  or  set  of  connection  weights  and  offsets  developed 
using  standard  back  propagation  that  has  a  maximum  error  of  «0.9.  The 
parameter  p  can  then  be  chosen  so  as  to  reduce  the  error  terms  for  those 
output  units  whose  errors  are  less  thanfs0.8.  The  goal  is  to  set  p  so  that 
only  the  few  worst  outputs  errors  are  unreduced.  By  increasing  p ,  it  is 
possible  to  reduce  learning  for  all  but  one  of  the  output  units  on  one  of  the 
training  patterns  as  an  extreme  example.  A  good  rule  of  thumb  for  an  initial 
value  of  £  is  either  the  ratio  of  problem  training  patterns  to  good  training 
patterns,  or  the  inverse  of  the  number  of  output  units  in  the  case  where  only 
a  single  output  unit  per  training  case  is  to  be  on.  In  either  case,  it  is 
often  necessary  to  tweak  the  value  somewhat.  As  the  maximum  error  decreases,  o 
may  increase  or  decrease,  and  it  is  usually  necessary  to  occasionally  modify  p. 


RESULTS  AND  CONCLUSIONS 

This  selective  learning  algorithm  has  been  successfully  applied  to 
training  a  connectionist  expert  system  and  to  radar  data  processing  with 
impressive  results. 

In  the  connectionist  expert  system  application, ^  the  training  set 
consisted  of  722  input  patterns  or  rules,  with  a  network  configuration  of 
23  input  units,  104  and  52  units  in  the  first  and  second  hidden  layers, 
respectively,  and  18  output  units.  Each  output  unit  corresponded  to  a 
particular  decision.  Thus  only  a  single  output  unit  should  be  on,  all  others 
being  off.  Since  each  of  the  input  patterns  represented  a  rule  from  the 
722-rule  rule  base,  it  was  unacceptable  to  have  any  of  the  training  patterns 
learned  incorrectly.  Also,  it  was  not  possible  to  assure  that  all  classes  of 
rules  be  adequately  represented  in  the  training  set.  Several  of  the  classes 
were  underrepresented,  leading  to  learning  problems  with  the  standard  back 
propagation. 
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No  amount  of  changing  the  network  size  or  the  values  of  the  gain  or 
momentum  parameters  resulted  in  a  network  that  could  successfully  learn  all  of 
the  rules.  There  were  always  at  least  two  failures.  In  an  expert  system  of 
any  type,  this  is  clearly  unacceptable.  With  the  selective  learning  employed, 
the  network  was  able  to  successfully  learn  all  of  the  training  patterns. 

The  other  major  application  for  which  we  have  used  the  algorithm  is  in 

Q 

the  processing  of  3-dB  signal/noise  radar  data.  It  has  been  used  to  boost 
performance  in  learning  the  training  set  of  1600  input  data  sets  consisting  of 
160  input  values  each.  It  has  proven  to  be  the  only  method  of  successfully 
learning  all  of  the  training  patterns.  (Expectedly,  this  decreases  the 
generalization  performance.)  It  has  also  been  used  to  reduce  training  times 
by  a  factor  of  five  when  the  goal  was  to  achieve  the  same  training  results  as 
with  standard  back  propagation. 

In  summary,  the  described  selective  learning  algorithm  has  proven  useful 
in  cases  where  standard  error  back  propagation  fails.  It  is  probably  most 
useful  where  accurate  memorization  of  training  patterns  is  required.  In  cases 
such  as  the  radar  signal  processing,  where  it  is  not  desirable  to  memorize  all 
of  the  noisy  training  data  and  over  training  can  occur,  its  utility  is  much 
less  clear  cut.  Nonetheless,  it  resulted  in  faster  training  with  comparable 
generalization  results  when  employed  as  a  training  accelerator  in  the  radar 
case . 
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GLOSSARY 

Opj  =  Output  of  the  j-th  node  on  the  p-th  pattern 
=  Offset  of  the  j-th  node 
S  =  Sigmoid  function 

t  j  =  Desired  output  of  the  j-th  mode  on  the  p-th  pattern 

Tpj  =  Intermediate  result  for  the  j-th  node  on  the  p-th  pattern 

wji  =  Weight  from  i-th  node  in  layer  n  to  j-th  node  in  layer  n  +  1 

a  =  Sigmoid  slope  parameter 

e  =  Learning  rate 

n  =  Rate  of  momentum  transference 

£  =  Scaling  factor  for  the  output  error  term  in  selective 

learning 

p  =  Multiple  of  the  rms  error  used  in  selective  learning 
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